EXHIBIT B 



To: Jim Kahle/Austin/IBM&IBIVIUS 

cc: Charles Moore/AUbtin/IBM^iBMUS, Briari Konigsburg/Austin/l6M@!BMUS 

Krom: Bdlaram Sinhdroy/Hotiyhkccpsiie/lfaM & IBMUS 

Subjeci; "IBM Confidential: Simplify Gb* I Update Logic ? 



Hi Jim, 

If we want there is a simplification we can make in the Global Branch History (GBH) update 
logic, but this has a performance impact for scientific code. This simplification works only 
when we do not have BTAC. In the absence of any exception to normal instruction fetching, 
we always fetch the next Sequential sector. 

An alternative (and much simpler) approach to update the GBH vector is to shift in u 0" when 
we fetch the next sequential sector. On a BHT redirect (that is, there is a taken branch in the 
fetch group), we backup the GBH vector and then shift in 'V (logic diagram below). GBH 
update for handling other exception cases remains similar. 

Functionally the major difference between PGR and simpler approach is that thesimpler approach 
shirts "O" even if the fetch group does not have any branch in it. This lowers performance for 
scientific code. Another difference is that there is no two cycle delayed update of the GBH vector 
(as in PGR). Two cycle delayed update lias little performance impact, but due to ft. the POR has more 
complex control and requires a moderate-sized state machine. Simpler approach has no state machine. 



did some performance runs on M2 model to compare the two approaches. Performance of the two 



approaches are about the same for commercial wofkload. However, the alternate approach is 
substantially worse for scientific workload. For example, for APR.U (partial differential 
equation solver - quire common in scientific code) we loose 5% CPH. 

APPLU (and other scientific code) typically has nested loops like the following, which simpler 
algorithm does not work well. 

do i = 1 to 100000 (a large number) 

do j ^ I to 5 (a low number) 
number of matrix equations (but no branch) 

enddo 
enddo 

After two iterations of the outer loop, our POR GBH update will be able to predict the inner 
loop bianch with near 100% accuracy. This is because, our algorithm learns that only when the 
1 1 -bit GBH vector is 0 1 0000 1 0000 we will get another 1 (or taken). All other cases it will 
pfed'M fallihroucjh. (We loose this advantage if the inner loop iterates more than 12 times). 

Simpler approach dilutes the GBH vector with too many "O"* (for which there is no branch). It 
the loop body above has 3 fetch groups the simpler approach will shift 3 bits per inner loop 
iteration. Since five iterations forms a repetitive cycle, to remember all the five iterations, 
we will need 6~3 ^ 1 s> bit GBH vector. With 1 1-bit GBH vector (POR), we will mispredict 2 out of 
every 5 iterations (with 1-bit predictor, for each mispredict we set the BHT to wrong value 
causing another mispredict). In the worst case, when there is too many fetch groups in the 
loop body (but no branoh) we can completely dilute the GBH vector with "0"s and end up with 
a lesuit similar to just LBHT (which is quite poor for such loops). APPLU has such loop 
bodies (see example beiow, which is abo one of the most frequently executed subroutine). 

Hor commercial code, the simpler approach works well. Commercial code is very branch-oriented 
In 1 1 successive fetch groups (1 bit in GBH per fetch group), there is almost always some 
branch. So if the piediction is path dependent, we almost never loose the path information. 
Moreover, the simpler algorithm might actually help in cases when we do not really need 
information from all the 1 1 previous branches. Some dilution in the GBH vector can actually 
reduce the same information from appearing in too many entries in the GBHT/GSEt. tables. 
That means better use of the GBHT/GSt'L tables and slightly better performance. 



Thanks. 

Regards, 
Balaram 



Performance Comparison (from Mudel2 ver 1 .b2) 

{POR numbers ate very close to what we expected from Malic performance modeling) 
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Logic Diagram for Simpler Approach 
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